import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
%matplotlib inline
sns.set()
train_X = pd.read_csv('../data/emod3d_train_x.csv')
train_X
train_y = pd.read_csv('../data/emod3d_train_y.csv')
train_y
train = pd.concat([train_X, train_y], axis=1)
train
train_samp = train.sample(frac=0.05)
train_samp
sns.pairplot(train_samp)
n_sub is generally always < 0.5 and doesn't appear to be highly correlated with core_hours. Its so skewed that it may be worth discarding anything greater, as that might throw off our predictions. The number of dots that lie above n_sub > 0.5 eyeballs to be similar to the number of dots that could be considered outliers in the other plots. Perhaps our prediction could by hybrid, using one model when n_sub is typically low, and another model when its higher, possibly just fudging in a higher core-hours budget.nx and ny, presumably inputs for some search-space, tend to grow with each other, such that the simulated area tends to be somewhere between a square and rectangle or ratio 1:2 or so.nz is interesting in that it presumably indicates that studies are done at one of two depths or depth-resolutions... we don't know what nz representsOther thoughts:
core_hours and were aborted.core_hours¶It seem that there is a large spike in number of core_hours near 0; this may be due to jobs that failed to start. Values that are exactly 0 will not be real jobs, but jobs that do run for a short period of time could still be valid (or might fail for some other reason).
sns.distplot(train_samp['core_hours'], bins=100)
outliers_too_low = train[ train['core_hours'] < 0.2 ]
outliers_too_low
sns.distplot(outliers_too_low)
train.drop(outliers_too_low.index, inplace=True)
train
n_sub in the pairwise plots¶n_sub is a continuous value, and we would like to identify all those points in the pairwise plots that come from n_sub being high (or low); in case that makes it easy to identify the cause of outliers.
Seaborn's pairplot allows us to give it a hue parameter, and you give it the name of a column in the data-frame, and for each value it will break those out as a separate series, but it really needs to a categorical data type.
That means that we need to (the nomenclature varies) to 'discretize' or 'bin' them.
train['n_sub_deciles'] = pd.qcut(train['n_sub'], q=10, precision=0)
train_samp['n_sub_deciles'] = pd.qcut(train_samp['n_sub'], q=10, precision=0)
sns.pairplot(train_samp,
vars=['nt', 'nx', 'ny', 'nz', 'n_sub', 'n_cores', 'core_hours'],
hue='n_sub_deciles')
n_cores in the pairplots¶Smaller jobs should correlate with fewer cores being requested, which should reflect the user's expectation of how large the job is.
sns.pairplot(train_samp,
vars=['nt', 'nx', 'ny', 'nz', 'n_sub', 'core_hours'],
hue='n_cores')
The complexity of the simulation would appear to be some product of nt (presumably number of time samples), nx, ny and nz (presumably granularity of the 3D simulation). If we use a regression method such as Multiple Regression, then the end up trying to find values for $\alpha$ and the coofficients $\beta_x$ in the following:
But if we see an apparent exponential or quadratic type of complexity in time, then it might be better to calculate a new column, such that we get:
$$ ntxyz = nt \times nx \times ny \times nz $$$$ \alpha + \beta_1\,ntxyz + \beta_2\,n\_sub + \beta_3\,n\_cores $$train_samp['ntxyz'] = train_samp['nt'] * train_samp['nx'] * train_samp['ny'] * train_samp['nz']
train['ntxyz'] = train['nt'] * train['nx'] * train['ny'] * train['nz']
sns.pairplot(train_samp, vars=['ntxyz', 'n_sub', 'n_cores', 'core_hours'], hue='n_sub_deciles')
# https://seaborn.pydata.org/tutorial/regression.html
sns.lmplot(
data=train,
x='ntxyz',
y='core_hours',
hue='n_sub_deciles',
ci=None,
height=10,
scatter_kws={"alpha": .1},
order=2)
# Ah alpha, the cheaty-statististicians way of expressing localised modality :)
Aha, so that's very interesting. It would appear that n_sub, like n_cores, reflect the user's expectation of job size and therefore runtime. This job is not 'embarrassingly parallel' in nature; we can see that as n_sub increases, overheads such as communication overheads and lock contention work to slow down the system; this presumably doesn't scale well in that final decile, where there is some very erratic performance. However, the final decile for n_sub is very very wide compared to the others, which could also mean that, for example, the 90% percentile is okayish, but the 99% is unusable.
Another reason that may explain the erratic behaviour of n_sub would be algorithmic changes: its possible that the code might have exhibited exponential complexity but then this was improved to get closer to linear performance. That could explain why there are a few data-points in the first decile that follow the same sort of exponential curve as the last decile. If that were indeed the case, we could drop them; we'd be unlikely to predict performance for an old version of the software.
What is that outlier in the top-left? If I colour the hue using either n_sub or n_cores I can't make any sence of it, even subdividing the n_sub bins over one million, ... that will be easier if I show you instead....
n_sub_bin_cuts=[0, 10_000, 100_000, 500_000, 1_000_000, 2_000_000, 3_000_000]
n_sub_bin_labels=['<10k', '<100k', '<500k', '<1M', '<2M', '<3M']
train['n_sub_bins'] = pd.cut(train['n_sub'],
bins=n_sub_bin_cuts, labels=n_sub_bin_labels)
plt.figure(figsize=(15,10))
sns.scatterplot(
data=train,
x='ntxyz',
y='core_hours',
hue='n_sub_bins',
hue_order=n_sub_bin_labels,
alpha=0.1
).legend(loc='center left', bbox_to_anchor=(1.02, 0.5), ncol=1)
# https://www.drawingfromdata.com/setting-figure-size-using-seaborn-and-matplotlib
The core_hours in the training data is apparently time in CPU, but when the underlying cluster filesystem is busy it can have an impact (which makes me think that this time reported time is really wall-clock hours * num_cores)...
...but then again, it is in a series of n_sub the clearly has astronimically bad performance; worse even than the 3-4 million range n_sub... I'm just going to assume that maybe that's an artefact from an older version of the code.
Let's see the raw data that exists around that point...
the_oddity = train[ ( train['core_hours'] > 1000 ) & ( train['n_sub_bins'] == '<3M' ) ][['nt', 'nx', 'ny', 'nz', 'n_sub', 'n_cores', 'core_hours']]
the_oddity
Interesting, but perhaps not surprising; the same output was arrived at from the same inputs. But is it true (in this case) that the same inputs produce the same outputs? That would be give us some good insight into external factors influencing the results.
train[ ( train['nt'] == 11408.0 ) & ( train['nx'] == 1208.0 ) & ( train['ny'] == 1831.0 ) & ( train['nz'] == 150.0 ) & ( train['n_sub'] == 2035075 ) & ( train['n_cores'] == 320.0 ) ][['nt', 'nx', 'ny', 'nz', 'n_sub', 'n_cores', 'core_hours']]
Here's a bit of an experiment; I'm going to create a fingerprint of each data-point based on rounding each input values, then I'm going to group by that fingerprint, and perform a scatterplot with the dot size being the standard deviation of the group. Might show something interesting.
Similar Unix command-line to create a similar fingerprint data-set; not that I'm going to use it.
paste -d, emod3d_train_x.csv emod3d_train_y.csv | awk -F, 'NR==1{next} { printf("%d_%d_%d_%d_%d_%d,%f\n", int($1/1000)*1000, int($2/100)*100, int($3/100)*100, int($4/100)*100, int($5/1000)*1000, int($6), $7)}' > emod3d_train_fingerprint.csv
# This part really just shows how to use 'apply' to iterate a function
# over a row.
def fingerprint(row):
return "{ntr}_{nxr}_{nyr}_{nzr}_{nsr}_{n_cores}".format(
ntr = int(row['nt'] / 1000) * 1000,
nxr = int(row['nx'] / 100) * 100,
nyr = int(row['ny'] / 100) * 100,
nzr = int(row['nz'] / 100) * 100,
nsr = int(row['n_sub'] / 1000) * 1000,
n_cores = int(row['n_cores']))
train['fingerprint'] = train.apply(fingerprint, axis=1)
train[['nt', 'nx', 'ny', 'nz', 'n_sub', 'n_cores', 'fingerprint']].head()
# this is probably a terrible example of using groupby
predictability_df = pd.DataFrame(
[ [ group['ntxyz'].iat[0],
group['core_hours'].mean(),
group['core_hours'].std(),
name ]
for name, group in train.groupby('fingerprint') ],
columns=['ntxyz', 'core_hours', 'core_hours_fingerprint_std', 'fingerprint']).dropna()
predictability_df
# https://seaborn.pydata.org/generated/seaborn.scatterplot.html
plt.figure(figsize=(10,7))
p = sns.scatterplot(
data=predictability_df,
x='ntxyz',
y='core_hours',
size='core_hours_fingerprint_std',
sizes=(10,200),
hue='core_hours_fingerprint_std',
palette=sns.cubehelix_palette(dark=.3, light=.8, as_cmap=True))
p.legend(loc='center left', bbox_to_anchor=(1.02, 0.5), ncol=1)
for i in predictability_df[ predictability_df['core_hours_fingerprint_std'] > 50.0 ].index:
p.text(
predictability_df['ntxyz'][i] + 0.03e13,
predictability_df['core_hours'][i],
"{:.1f}".format(predictability_df['core_hours_fingerprint_std'][i]),
horizontalalignment='left', size='medium', color='black', weight='normal')
plt.title('Standard Deviation of core_hours for Neighbouring Inputs')
plt.show()
So... something has happened to create some large outliers which cannot be accounted for in the input. Oddly enough, some of these outliers indicate that the program run much faster than normal, as well as some much worse, which might be indicative of system-level issues, or potentially versions of software (not captured in the input) that were faulty.
We could just rip those out; the question would then be at which threshold should we remove those data-points? 65 is not much when you consider we would still be overestimating for our runtime prediction, but 447 would be completely out. 85 seems to be legitimate for the performance curve we have observed. 241 is similarly very significant, particularly when its location on the Y axis is 200; similarly the 140. So I'm confortable in this case making a threshold of 100.
Do do this, we look at the training data and drop all rows with a convicted fingerprint. Hence, its useful to make fingerprints reasonably small so as not to convict other data points which may be innocent, but you still need enough in each fingerprint to have enough of a sample for a useful standard deviation.
train.drop(train[ train.fingerprint.isin( predictability_df[ predictability_df.core_hours_fingerprint_std > 100 ].fingerprint ) ].index, inplace=True)
Now let's see what our data looks like, using the same plot as used previously:
plt.figure(figsize=(15,10))
sns.scatterplot(
data=train,
x='ntxyz',
y='core_hours',
hue='n_sub_bins',
hue_order=n_sub_bin_labels,
alpha=0.1
).legend(loc='center left', bbox_to_anchor=(1.02, 0.5), ncol=1)
Great stuff; that's looking much cleaner. The Big Red Dot is still significant, but it may be too disingenuous to remove that.
n_sub in the test data?¶Seems that we can; the distribution looks to be very very similar; perhaps the test set was a defined using a modulo-N of the total data. This will be bad, because it then very likely include the same system noise that produced the outlier fingerprints.
test_x = pd.read_csv('../data/emod3d_test_x.csv')
fig, ax = plt.subplots(1,3)
fig.set_figwidth(15)
sns.distplot(test_x['n_sub'], bins=100, ax=ax[0])
sns.distplot(train['n_sub'], bins=100, ax=ax[1])
# See how close they really are; taller the spike at 0: fewer differences
sns.distplot(test_x['n_sub'] - train['n_sub'], bins=100, ax=ax[2])
fig.show()
Given how predictable the program beforms with n_sub <10k, we can certainly minimise the loss there, and we've seen that behaves nice and linearly; <500k could be treated much the same as <1M, although a 2nd order polynomial will be our friend there, although we might want to add a safe factor based on n_sub.
# https://towardsdatascience.com/linear-regression-in-6-lines-of-python-5e1d0cd05b8d
from sklearn.linear_model import LinearRegression
n_sub_bin='<10k'
# Get the data for this n_sub partition
series = train[ train.n_sub_bins == n_sub_bin ]
# https://pandas.pydata.org/docs/user_guide/reshaping.html
X = series.ntxyz.values.reshape(-1, 1)
Y = series.core_hours.values.reshape(-1, 1)
# Fit the model
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
# Make predictions to visualise
Y_pred = linear_regressor.predict(X)
# Show the input points with the regressed model
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()
#for n_sub_bin in train.n_sub_bins.unique dropna():
n_sub_bin='<500k'
# Get the data for this n_sub partition
series = train[ train.n_sub_bins == n_sub_bin ]
# https://pandas.pydata.org/docs/user_guide/reshaping.html
X = series.ntxyz.values.reshape(-1, 1)
Y = series.core_hours.values.reshape(-1, 1)
# Fit the model
linear_regressor = LinearRegression()
linear_regressor.fit(X, Y)
# Make predictions to visualise
Y_pred = linear_regressor.predict(X)
# Show the input points with the regressed model
plt.scatter(X, Y)
plt.plot(X, Y_pred, color='red')
plt.show()
import statsmodels.api as sm
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
train_X[['nt','nx','ny','nz','n_sub','n_cores']]
# scale.fit_transform(train_X[['nt','nx','ny','nz','n_sub','n_cores']].as_matrix())